Part A

The dataset contains 398 rows and 9 columns

  1. mpg has mean of 23.5 and standard deviation of 7.8
  2. cyl has a mean of 5.4 and standard deviation of 1.7
  3. disp has a mean of 193.4 and standard deviation of 104.2
  4. wt has a mean of 2970.4 and standard deviation of 846.84
  5. acc has a mean of 15.5 and standard deviation of 2.7
  6. yr has a mean of 76 and standard deviation of 3.6
  7. origin has a mean of 1.5 and standard deviation of 0.8

6 missing values in hp

0.015075% of data in hp is missing this missing values will be replced by its meadian

there is no missing value

dropping 'yr','origin','car_name','cyl' columns

  1. mpg may have two clusters
  2. mpg and disp have negative corelation
  3. mpg and hp have negative corelation
  4. mpg and wt have negative corelation
  5. the above 3 have same corelation.
  6. mpg and acc do not very low corelation
  1. less cylinder has less displacment and less weight.
  2. more cylinder has more displacment and more weight.
  3. displacement and weight has positive corelation.
  4. there are few outliers
  1. less cylinder has more mpg and less weight.
  2. more cylinder has less mpg and more weight.
  3. mpg and weight has negative corelation.
  4. there are some outliers.

Clustering

elbow method

there are two possible elbow : 2 and 4.

linear regression

Kmeans clustering

PART B

null value

there are few missing values

there is no duplicate rows

We can see that the first six components explain more than 95% of variation. Between first 5 components, more than 91% of the information is captured. The above plot shows almost 95% variance by the first 6 components. Therefore we can drop 7th component onwards.

Training SVM

Both the model has more than 90% accuracy on the test data, PCA used only 6 attributes to come up with an accuracy of 90%+ where as the model without pca used all the variables to come up with 90%+ accuracy, the difference can be illustrated even better if the dataset had been cursed with dimensionality, since its 18 variable in the original data the difference is very subtle.